3 Main Functions
PATO has two main data classes, mash and mmseq. MASH is a whole-genome comparison tool that can compare thousands of genomes in a few minutes (even seconds).
On other hand, MMSeq2 is a ultra fast software for sensitive search and clustering genes or proteins. Since last versions of MMSeq2 it include lin-clust method that allow to cluster huge data sets in linear time (most of the method are quadratic time or log(n)²) but with high identities.
Both MASH and MMSeq2 allow multi-threading to accelerate the processing.
3.1 MASH
Mash is a method for “Fast genome and metagenome distance estimation using MinHash”(). MASH accept nucleotide or aminoacid fasta files and estimated a similarity distance. To extend the information we recommend to read the main paper Mash: fast genome and metagenome distance estimation using MinHash. Ondov BD, Treangen TJ, Melsted P, Mallonee AB, Bergman NH, Koren S, Phillippy AM. Genome Biol. 2016 Jun 20;17(1) :132. doi: 10.1186/s13059-016-0997-x
To create a mash object you need to supplie a list (a vector indeed) of files. By default mash use Amino Acids fasta files.
my_mash <- mash(my_list, type ="prot")
Besides, you can change some parameter as n_cores, sketch, kmer and type.
my_mash <- mash(my_list, n_cores = 20, sketch = 1000, kmer = 21, type = "prot")
3.2 MMSeqs2
Creators of MMSeqs define their software as:
MMseqs2 (Many-against-Many sequence searching) is a software suite to search and cluster huge protein and nucleotide sequence sets. MMseqs2 is open source GPL-licensed software implemented in C++ for Linux, MacOS, and (as beta version, via cygwin) Windows. The software is designed to run on multiple cores and servers and exhibits very good scalability. MMseqs2 can run 10000 times faster than BLAST. At 100 times its speed it achieves almost the same sensitivity. It can perform profile searches with the same sensitivity as PSI-BLAST at over 400 times its speed.
To create a mmseq object you just have to run:
my_mmseq <- mmseqs(my_list)
You can change default parameter as coverage, identity or evalue among n_cores. Also you can set especifict parameters of clustering (searching for protein families and orthologous). These parameters are cov_mode and cluster_mode.
my_mmseq <- mmseqs(my_list, coverage = 0.8, identity = 0.8, evalue = 1e-6, n_cores = 20, cov_mode = 0, cluster_mode = 0)
The mmseq object is key in the analysis of the pangenome definition (core, accessory, pangenome) as well as other process such us annotation or population structure (if you are using accessory to defined it).
3.2.1 Coverage Mode
(this section has been extracted from https://github.com/soedinglab/MMseqs2/wiki)
MMseqs2 has three modes to control the sequence length overlap “coverage”: (0) bidirectional, (1) target coverage, (2) query coverage and (3) target-in-query length coverage. In the context of cluster or linclust, the query is seen representative sequence and target is a member sequence. The -cov-mode flag also automatically sets the cluster_mode.
3.2.2 Clustering modes
All clustering modes transform the alignment results into an undirected graph. In this graph notation, each vertex (i.e. node) represents a sequence, which is connected to other sequences by edges. An edge between a pair of sequences is introduced if the alignment criteria (e.g. identity, coverage and evalur) are fulfilled.
The Greedy Set cover (cluster-mode = 0) algorithm is an approximation for the NP-complete optimization problem called set cover.
Set Cover clustering
Greedy set cover works by interactively selecting the node with most connections and all its connected nodes to form a cluster and repeating until all nodes are in a cluster.
The greedy set cover is followed by a reassignment step. A Cluster member is assigned to another cluster centroid if their alignment score was higher.
Connected component (cluster_mode = 1) uses transitive connection to cover more remote homologous.
Connected component clustering
In connected component clustering starting at the mostly connected vertex, all vertices that are reachable in a breadth-first search are members of the cluster.
Greedy incremental (cluster_mode = 2) works analogous to CD-HIT clustering algorithm.
Greedy incremental clustering
Greedy incremental clustering takes the longest sequence (indicated by the size of the node) and puts all connected sequences in that cluster, then repeatedly the longest sequence of the remaining set forms the next cluster.
3.3 AccNET
Accnet is a representation of the “accessory genome” using a bipartite network. (see more in Lanza, V. F., Baquero, F., de la Cruz, F. & Coque, T. M. AcCNET (Accessory Genome Constellation Network): comparative genomics software for accessory genome analysis using bipartite networks. Bioinformatics 33, 283–285 (2017)). Accnet creates a graph using two kind of nodes: genomes and proteins/gene. One genome is connected to one proteins if the genome has this protein (i.e. a protein that belong to this protein family/cluster). The accnet object is other of the main data type in PATO. PATO can create some representations of the accessory genome (dendrograms, networks, plots….) and has some functions to analyze the accessory genome properly. The function accnet() builds an accnet object taking into account the maximun frequency of each protein/gene to be considered as part of accessory genome. Moreover, the user can decide to include single protein/genes or not (those which are only present in one sample).
Accnet depends of the definition of protein families so the input of the accnet() function is a mmseqs object. The definition of the homologous cluster is critical to define the accessory genome.
accnet() can be called as:
my_accnet <- accnet(my_mmseq, threshold = 0.8, singles = TRUE)
or piping the mmseqs() function
my_accnet <- mmseqs(my_files) %>% accnet(threshold = 0.8)
3.4 Classes
PATO has been designed as a toolkit. Most of the functions use as input other outputs functions, creating a complete workflow to analysis large genome datasets. As R package, PATO can interact with other packages as ape, vegan, igraph or pheatmap. We have tried to developed PATO compatible with tidyverse packages in order to use %>% pipe command and to be compatible with ggplot2 and extensions (as ggtree).
In order to create a robust environment PATO uses specific data classes (S3 objects) to assure the compatibility among functions. PATO has 4 object classes:
mashobjectmmseqobjectaccnetobjectnr_listobjectgff_listobjectcore_genomeobjectcore_snps_genomeobjectaccnet_enrobject
Other outputs take external objects as igraph or ggplot.
The idea to have these object is to share a structured data among functions. These objects can be inspected and some of their data can be used with external function/packages.
3.4.1 mash class
A mash is a list of two element: table, matrix.
The first one, my_mash$table, is a square and symmetric matrix with the distances among genomes. As a matrix has genomes as rownames and colnames
The second one, my_mash$matrix is a data.table/data.frame with all the distances as list. The table has the columns c("Source","Target","Dist")
3.4.2 mmseq class
A mmseq object is a list of two elements: table and annot.
First element, my_mmseq$table, includes a data.table/data.frame with four columns c("Prot_genome", "Prot_Prot", "Genome_genome", "Genome_Prot"). This is the output of MMSeqs2 and described the clustering of the input genes/proteins. First column referees to the genome that contain the representative gene/protein of the cluster. Second one, is the representative protein of the cluster (i.e. the cluster name). Third column is the genome that includes the gene/protein of the fourth column.
In the second element, my_mmseq$annot, we can find a data.frame/data.table with the original annotation of all representative gene/protein of each cluster in two columns. The first one Prot_prot is the same that the second one of the first element.
3.4.3 accnet class
An accnet object is a list of three elements: list, matrix, annot.
First element, my_accnet$list, includes a data.frame/data.table with the columns c("Source","Target","degree") the correspondences among proteins/genes and genomes and the degree of the corresponding protein/gene. In some cases, for example, when the accnet object cames from accnet_with_padj function or pangenomes the third column can be different and includes information such us the frequency of the protein/gene in the pangenome (in the case of pangenomes) , or the p-value of the association between the genome and the protein/gene (in the case of accnet_with_padj)
Second element, my_accnet$matrix, is a presence/absence matrix. It is very important to know that it has a column Source, so in the case of use the matrix out of PATO you should convert this column in rownames (my_matrix <- my_accnet$matrix %>% column_to_rownames("Source"))
3.4.4 nr_list class
A nr_list is a S3 object of three elements: Source,centrality and cluster. nr_list can be coerced to a data.frame of three columns using as.data.frame function. Element Source has the accession/name of the genomes, centrality has the centrality value of the genome, according to the degrees of vertices, for its cluster. Finally, cluster has the membership of the Source genome.
3.4.5 gff_list class
A gff_list is an object that store the results of load_gff_list() output. It store the path to the GFF, FNA, FFN and FAA files. A gff_list object can be used as input for classifier, mash, mmseq, core_snp_genome and snps_pairwaise.
3.4.6 core_genome and core_snps_genome
These are objects produced by the core_ functions. They contains the alignment information (core or core_snps) and can be exported to FASTA format or used in other functions.
3.4.7 accnet_enr
This object is a data.frame with the result of accnet_enrichment_analysis(). That object can be exported as a gephi file with the p-values (adjusted) as a edge-weight.